2023-04-25 - Scraping


main topics

Parsing files in the following formats:


CSV

reading from a CSV file

require 'csv'

file = 'beatles.csv'
# "First Name","Last Name","Instrument"
# "John","Lennon","Guitar"
# "Paul","McCartney","Bass Guitar"
# "George","Harrison","Lead Guitar"
# "Ringo","Starr","Drums"

CSV.foreach(file, 'r', headers: :first_row) do |row|
  full_name = "#{row['First Name']} #{row['Last Name']}"
  instrument = row['Instrument']
  puts "#{full_name} plays #{instrument}"
end

storing in a CSV file

require 'csv'

beatles = [
  ["John","Lennon","Guitar"],
  ["Paul","McCartney","Bass Guitar"],
  ["George","Harrison","Lead Guitar"],
  ["Ringo","Starr","Drums"]
]

file = 'beatles.csv'

CSV.open(file, 'wb') do |csv|
  # NOTE: we can only store an array of Strings
  csv << ['First Name', 'Last Name', 'Instrument']
  beatles.each { |beatle| csv << beatle }
end

JSON

parsing a JSON string

Just use JSON.parse(json_payload). Example:

require 'json'

file = 'beatles.json'

# put file contents in a string
json_payload = File.read(file)

# parse JSON contents into a Ruby object
beatles = JSON.parse(json_payload)

# NOTES:
# 1. a JS Object becomes a Ruby Hash
# 2. a JS Array becomes a Ruby Array

converting a Ruby Hash into a JSON string

Just use JSON.generate(my_hash). Example:

require 'json'

# here's the Hash
beatles = {
  "title": "The Beatles",
  "beatles": [
    {
      "first_name": "John",
      "last_name": "Lennon",
      "instrument": "Guitar"
    },
    {
      "first_name": "Paul",
      "last_name": "McCartney",
      "instrument": "Bass Guitar"
    },
    {
      "first_name": "George",
      "last_name": "Harrison",
      "instrument": "Lead Guitar"
    },
    {
      "first_name": "Ringo",
      "last_name": "Starr",
      "instrument": "Drums"
    }
  ]
}

file = 'beatles.json'

File.open(file, 'wb') do |file|
  json_payload = JSON.generate(beatles)
  file.write(json_payload)
end

XML / HTML

Note: requires installing the nokogiri gem. See https://nokogiri.org/.

Basically this:

require 'nokogiri'

def get_html_doc(url)
  html = URI.open(url).read
  Nokogiri::HTML.parse(html)
end

doc = get_html_doc('https://meleu.sh')

# then you can use #search with a CSS Selector
puts doc.search('h1').text

See also this code to list the top movies from IMDB: top_movies.rb