代做Homework: Analysis of the Haunted Places Dataset代做Java编程

Homework: Analysis of the Haunted Places Dataset

Due: Friday, March 14, 2025 12pm PT

1. Overview


Figure 1: The Haunted Places dataset: mysterious reports strange sightings in and around the United States consisting of 21,983 rows x 10 columns. The full dataset can be found at https://www.kaggle.com/datasets/sujaykapadnis/haunted-places.

In this assignment we will explore several of the topics discussed in the early portion of class – Big Data – MIME types and their taxonomy – Data Similarity – and so forth. To do this, we will leverage the dataset highlighted in Figure  1  –  a  set  of 21,983 reported mysterious haunted places. The posts contain several features which are highlighted below:

City – the city in which the haunted place resides in.

Country - The country where the place is located (always "United States").

Description  - A  text  description  of the  place. The  amount  of  detail  in these descriptions is highly variable.

Location - A title for the haunted place.

State - The US state where the place is located.

State_abbrev - The two-letter abbreviation for the state.

Longitude - Longitude of the place.

Latitude - Latitude of the place.

City_longitude - Longitude of the city center.

City_latitude - Latitude of the city center.

The Haunted Places dataset is a rich dataset with high variation in its features and properties. For example, as can be seen from Fig 2., there are 4,386 unique cities, and 9,904 unique locations in the dataset. These locations are always somewhere in the United  States mainland, including Alaska and Hawaii.

Figure 2. Distribution of features in the Haunted Places dataset.

One of the other important elements of the Haunted Places dataset is the description of the sighting. There are roads, streets, homes, commercial buildings, and other potentially haunted areas, with high variation. The descriptions of the haunted places gives the reader some inclination as to the modality of why the place is haunted, such as “If you take a camera…[sic] and take pictures, you’ll see orbs everywhere”, or “during the constructions of the train trestle, two men were found inexplicably hanging by suspension cords used to lift steel beams. Passersby have noted the sound of snapping necks…”, or “This street had so much activity that the neighbors were comparing stories that at night they experienced that beings were laying down on top of them”, and so on.

These descriptions have rich modalities in the way that they were observed (sound, sight, smell, etc.) Additionally, some describe events that occurred (murder?) whereas others describe supernatural objects inhabiting dark areas (“orbs” near trees). Others describe eye witness accounts of what happened, and others even more so temporal characteristics that influence the evidence, such as happening during the day; or at night, or early evening. Some are repeating events, while others occurred once only, maybe twice. And  some of the events have multiple witnesses whereas some are described as a story from a singular witness or an “ancient tale” .

Geographic areas, proximity to city center are also useful features. Did this sighting occur far out in the woods away from a city center where people are? Does that affect the witness reports? Or was this an event that happened right in metropolitan area. How should that effect how you weigh the evidence?

Take the cardinality of witnesses for example. Are single witness sightings more likely to have a follow up performed?  You could use the number-parser Python tool to extract out the numerical value of number of witnesses: https://github.com/scrapinghub/number-parser as an example.

Additional insight can be gleaned from exploring the temporal properties of sighting, which are semi-structured and only present in the description. However, using a tool such as https://github.com/akoumjian/datefinder you  can  easily  discern  if there  is  information about when these sightings occurred provided the information is extractable. Are there particular days of the week with sightings? What happens if you  slice down sightings related to geographic area and date?

2. Objective

Exploring the Haunted Places dataset may lead you to ask several questions to build up a profile of evidence related to the event. You can group these questions into the following characteristics:

Haunted Place Modality

-          Evidence

o    audio evidence? (As discerned by “noises”, “sound of snapping neck”, “nursery rhymes”)

o    image/video or visual evidence? (As discerned by keywords in the description like “cameras” and take pictures”, “names of children written on walls”, etc.)

-          Is  it  haunted  due  to  some  event? (“When the railroad was built,  happened”)

Temporal characteristics

- Can time of day be discerned? Evening, afternoon, morning?

- Can date be discerned? (Year, month, day?)

- If the date/time is not directly present, can you find it by searching through description or location name on Google?

- If not, perhaps set year to 2025, day to 01 and month to 01

Apparition type

- Is it a ghost?

- Is there a ghost description (male/female/child?)

- Is it an orb? Unidentified Flying Object  (UFO)? Unidentified Aerial Phenomena (UAP)?

Event type

- Did someone die here? Murder? Natural?

- How many people?

- How many witnesses?

Figure 3: Evidentiary questions you could ask about the Haunted Places dataset, deriving features that aren’t initially present.

What ways could you explore this, by looking beyond this rich dataset and by applying lessons learned from class thus far where we have been studying the 5 V’s, MIME types of associated datasets, and large datasets and their characteristics? Could you join for example, the Alcohol abuse by state data from https://drugabusestatistics.org/alcohol- abuse-statistics/ and then determine if there is correlation between population level abuse of alcohol and the presence of these Haunted places? What about a dataset that gave you information  about  visibility  at  that  city/state/location  during  all  times  of the  year  to determine the quality of the visual evidence and observations taken by the reporters of these  Haunted  places?  There  are  multiple  datasets  providing  evidence  of  amount  of daylight    per     state,     such     as https://www.timeanddate.com/astronomy/usa and https://aa.usno.navy.mil/data/Dur_OneYear.

What about police reports? Does the amount of police reports in a community have any  affect as to the connection between the community and murders and other violent crime  that may suggest a correlation between these supernatural events? For example you could  look   at    violent    crime   and    other    police   reports    by   state   with    this    dataset, https://projects.csgjusticecenter.org/tools-for-states-to-address-crime/50-state-crime-data/.

And finally, recently, a website TarotCards.io, recently studied and revealed the most Haunted states by Google search volume and published the following statistics:

Key Findings:

Most Haunted State (Total Search Volume): California takes the top spot with 2,860 average monthly searches related to hauntings, followed by Texas (2,680) and New York (2,150).

Least Haunted State (Total Search Volume): Alaska ranks last, with just 570 searches per month.

Most Haunted State (Per Capita): Wyoming leads the way, with 98.89 searches per 100,000 residents, followed closely by Vermont (92.62) and North Dakota (81.12).

Least Haunted State (Per Capita): California is the least haunted by search interest when adjusted for population size, with only 7.25 searches per 100,000 people.

The Most Haunted States by Total Search Volume

California – 2,860 searches Texas – 2,680 searches

New York – 2,150 searches Florida – 2,080 searches

Ohio – 2,070 searches

The Least Haunted States by Total Search Volume

46. Hawaii – 650 searches

47. North Dakota – 640 searches

48. Vermont – 600 searches

49. Wyoming – 580 searches

50. Alaska – 570 searches

The Most Haunted States Per Capita (Per 100,000 Inhabitants)

Wyoming – 98.89 searches

Vermont – 92.62 searches

North Dakota – 81.12 searches Alaska – 77.71 searches

South Dakota – 77.52 searches

The Least Haunted States Per Capita (Per 100,000 Inhabitants)

46. Pennsylvania – 13.59 searches

47. New York – 10.86 searches

48. Florida – 9.05 searches

49. Texas – 8.56 searches

50. California – 7.25 searches

The    full    dataset    can    be    found    here: https://docs.google.com/spreadsheets/d/1-

ok5MWfRfGpO2nJkL3zcyDaw-

PEEFKGOgiynJ3YiXS0/edit?gid=1333584719#gid=1333584719.

What other features can you think of that would be useful to join to the Haunted places dataset?

You will choose at least three additional publicly accessible datasets along these lines to join the Haunted Places dataset to, and you must add at least three new features per dataset that you join. The datasets you select may not all belong to the same MIME top level type – that is – you must pick a different MIME top level type for each of the three datasets you are joining to this Haunted Places dataset.

Once the data is joined properly, you will explore the combined dataset using Apache Tika and an associated Python library called Tika-Similarity. Using Tika Similarity, you can evaluate data similarity (as discussed during the Deduplication lecture in class; and also during data forensics discussions). Tika  similarity will allow you to  explore and test different distance metrics (Edit-Distance; Jaccard similarity; Cosine similarity, etc.). And it will give you an idea of how to cluster data, and finally it will let you visualize the differences between different clusters in your new combined dataset. So, you can figure out  how  similar  Haunted  Places  are,  given  their  locations,  descriptions,  witnesses, apparition types, geographic proximity to certain towns, and other features, and explore your new augmented dataset. For example, you may ask, how many Haunted places were actually ghost sightings that occurred with low light visibility in the Fall season and if the sightings were nearby any airports and included reports from at least two witnesses?

The assignment specific tasks will be specified in the following section.

3. Tasks

1.   Download and install Apache Tika

a.   Chapter 2 in your book covers some of the basics of building the code, and additionally, see https://tika.apache.org/2.9.1/index.html

b.   Install Tika-Python, you can pip install tika to get started.

i.   Read             up             on             Tika             Python             here:

http://github.com/chrismattmann/tika-python

2.   Download the Haunted Places dataset

a.   We will provide you a Dropbox link in Slack for each validated team

b.   Make a copy of the original dataset (because you are going to modify/add to it in this assignment)

3.   Create a combined TSV file for your Haunted Places dataset

a.   Convert the CSV to TSV (here’s a simple example of how to do this with Python https://unix.stackexchange.com/questions/359832/converting-csv- to-tsv)

4.   Add and expand the dataset with the following features

a.   Add a new feature called “Audio Evidence” . Set it to True if you match text  like “noises”, “sound of snapping neck”, “nursery rhymes”, False otherwise.

b.   Add a new feature called “Image/Video/Visual Evidence” . Set it to True if you match text like “cameras” and “take pictures”, “names of children written on walls”, etc.

c.   Add    a     new    feature    called     “Haunted    Places    Date”     and    use https://github.com/akoumjian/datefinder on the Description text to pull out dates. If you can’t find them, set the current date to 2025/01/01.

d.   Add a new feature called “Haunted Places Witness Count” and use Number Parser   : https://github.com/scrapinghub/number-parser to   obtain    the number of witnesses (if possible to identify in the description). If unable to identify, set count to 0.

e.   Add a new feature called “Time of Day”,  and try to discern “Evening”, “Morning”, or Dusk” from the text. If not discernable, set to “Unknown” .

f.   Add   a  new   feature  called  “Apparition  Type”  and  discern  from  the description  if it  is  a Ghost”,  “Orb”,  “UFO”,  UAP”,  or  what  type  of apparition  the  Haunted  place  is  inhabited  by.  Perhaps  it  is  a Male”, “Female”, “Child”, or “Several Ghosts” . Parse this from the description and set to Unknown” if not discernable.

g.   Add a new feature called “Event type” . Was it a murder? Did someone die here? Was  it  a  supernatural  phenomenon?  Discern  this by parsing the “Description” text and searching for keywords.

h.   Join       the       Alcohol       Abuse       by        State        dataset,        here

https://drugabusestatistics.org/alcohol-abuse-statistics/. i.    Join the amount of daylight by state dataset, here:

i. https://www.timeanddate.com/astronomy/usa ii. https://aa.usno.navy.mil/data/Dur_OneYear

5.   Identify at least three other datasets, each of different top level MIME type (can’t all be e.g., text/*)

a.   Check out places including: https://catalog.data.gov/dataset (Data.gov)

b.   For each dataset, develop a Python program to join the data to your new Haunted Places dataset

c.   For each non text/* dataset, be prepared to describe how you featurized the dataset

d.   Each dataset that you join must contribute at least three features (in addition to the features you are adding described in part 5)

e.   For each feature you add, be prepared to discuss what types of queries it will allow you to answer and also how you computed the feature

6.   Download and install Tika-Similarity

a.   Read the documentation

b.   You can find Tika Similarity here (http://github.com/chrismattmann/tika- similarity)

c.   You        will        also        need        to        install        ETLLib,        here

(http://github.com/chrismattmann/etllib)

d.   Convert the TSV dataset into JSON using ETLLib’s tsv2json tool

e.   Compare Jaccard similarity, edit-distance, and cosine similarity using Tika Similarity

f.   Compare and contrast clusters from Jaccard, Cosine Distance, and Edit Similarity – do you see any differences? Why?

g.   How  do  the  resultant  clusters  generated  highlight  the  features  you extracted? Be prepared to identify this in your report.

7.   Package your data up by combining all of your new JSONs with additional features into a single TSV (tab separated values) file where the columns represent the features and the rows are the instances of your sightings.

8.   (EXTRA CREDIT) Add some new D3.js visualizations to Tika Similarity

a.   Currently Tika Similarity only supports Dendrogram, Circle Packing, and combinations of those to view clusters, and relative similarities between datasets

b.   Download and install D3.js  i.   Visit http://d3js.org/

ii.   Review Mike Bostock’s Visual Gallery Wiki    iii. https://github.com/mbostock/d3/wiki/Tutorials iv.   Consider adding

1.   Feature related visualizations, e.g., time series, bar charts, plots

2.   Add functionality in a generic way that is not specific to your dataset

3.   See gallery here: https://github.com/d3/d3/wiki/Gallery

4.   Contributions will be reviewed as Pull Requests in a first come, first serve basis (check existing PRs and make sure you aren’t duplicating what some other group has done)

4. Assignment Setup

4.1 Group Formation

You can work on this assignment in groups sized at minimum 2, and maximum 6. You may reuse your existing groups from discussion in class. Please fill out the group details in the form. provided after class. Only one form submission per team. If you have any questions, contact your TA/Course Producer via their email address with the subject: DSCI 550: Team Details.

4.2 Haunted Places dataset

Access to the data is provided by a Dropbox link. The dataset itself is approximately 5.4Mb unzipped. You may want to distribute the data between your team-mates since the data is fairly small (for now).

4.3 Downloading and Installing Apache Tika

The quickest and best way to get Apache Tika up and running on your machine is to grab the tika- app.jar from: http://tika.apache.org/download.html. You should obtain a jar file called tika- app-2.9.1.jar. This jar contains all of the necessary dependencies to get up and running with Tika by calling it your Java program.

Documentation  is  available  on  the  Apache  Tika  webpage  at http://tika.apache.org/.  API documentation can be found at http://tika.apache.org/2.9.1/api.

Since you will be using Tika Python, you will want to read up on the Tika REST API, here: https://cwiki.apache.org/confluence/display/TIKA/TikaServer.    The  Tika  Python  library  is  a robust REST client to the Java-side REST API.

You can also get more information about Tika by checking out the book written by Professor Mattmann called “Tika in Action”, available from: http://manning.com/mattmann/.

5. Report

Write a short 4-page report describing your observations, i.e. what you noticed about the dataset as  you completed the tasks. What questions did your new joined datasets allow you to answer about  the Haunted Places data and its sightings and additional features previously unanswered? What  clusters  were  revealed?  What  similarity  metrics  produced  more  (in  your  opinion)  accurate  groupings? Why? What did the additional datasets  suggest about “unintended consequences” related to Haunted Places? You should also clearly explain which datasets you used to join the  Haunted Places and how you extracted the new features from each dataset.

Thinking more broadly, do you have enough information to answer the following:

1.          Are there clusters of Haunted Places with similar features, and all are murders occurring in the evening?

2.          Does the time of day of the Haunted Place original sightings matter?

3.          Are specific locations more likely to be influenced by alcohol abuse that cause more Haunted Places to be reported?

4.          Are specific keywords bigger indicators of the apparition type related to a Haunted Place?

5.          Is there a set of frequently co-occurring features that define a particular Haunted Place?

6.          What insights do the “indirect” features you extracted tell us about the data?

7.          What clusters of Haunted Places made the most sense? Why?

Also include your thoughts about Apache Tika – what was easy about using it? What wasn’t?

Note: Report should be written using 11 pt Times New Roman font, single column with single spacing.

6. Submission Guidelines

This assignment is to be submitted electronically, by 12pm PT on the specified due date, via Gmail [email protected] for the Thursday class, or [email protected] for the Tuesday class. Use the subject line: DSCI 550: Mattmann: Spring 2025: BIGDATA Homework: Team XX. So, if your team was team  15, and you had the Thursday class, you would submit an email to [email protected] with the  subject  “DSCI  550: Mattmann:  Spring  2025: BIGDATA Homework: Team 15” (no quotes). Please note only one submission per team.

●          All source code is expected to be commented, to compile, and to run. You should have at least a few Python scripts that you used to join three other datasets, and what you used to extract   additional features.

●          Use relative paths {not absolute paths} when loading your data files so that we can execute your script/notebook files without changing everything.

●          If using a notebook environment, use markdown cells to indicate which tasks/questions you are solving.

●          Include your updated dataset TSV. We will provide a Dropbox or Google Drive location for you to upload to {you don't need to attach it inside the zip file}.

●          Also prepare a readme.txt containing any notes you’d like to submit.

●          If you used external libraries other than Tika Python and Tika Similarity, you should

include those jar files in your submission, and include in your readme.txt a detailed explanation

of how to use these libraries when compiling and executing your program.

●           Save your report as a PDF file (TEAM_XX_BIGDATA.pdf) and include it in your submission.

●          Compress all of the above into a single zip archive and name it according to the following filename convention:

TEAM_XX_DSCI550_HW_BIGDATA.zip

Use only standard zip format. Do not use other formats such as zipx, rar, ace, etc.

●          If your homework submission exceeds the Gmail's 25MB limit, upload the zip file to

Google drive and share it with [email protected] (Thursday class) or [email protected] (Tuesday class).

When submitting, please organize your code and data file as the directory structure shown: Data

dataset1 {leave it empty for now}

Source Code

script1

notebook

Readme.txt

Requirements.txt

Important Note:

●          Make sure that you have attached the file when submitting. Failure to do so will be treated as non-submission.

●           Successful submission will be indicated in the assignment’s submission history. We

advise that you check to verify the timestamp, download and double check your zip file for good measure.

●          Again, please note, only one submission per team. Designate someone to submit.

6.1 Late Assignment Policy

●          -10% if submitted within the first 24 hours

●          -15% for each additional 24 hours or part thereof


热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图